> Agent Harness Engineering — Synthesis
Agent Harness Engineering — Synthesis
How to make AI coding agents work reliably: the emerging discipline of designing the system around the model.
The core thesis
The competitive advantage in AI-assisted engineering is not the model — it's the harness: the structured system of context, tools, guardrails, and feedback loops engineered around the agent.
"The new hierarchy won't be based on who codes the fastest — it will be based on who can orchestrate uncertainty without losing authority." — Karpathy
This synthesis pulls from five distinct sources at very different scales — from "what one engineer does in one repo" to "what MercadoLibre does across 20,000 developers" — and finds they're all converging on the same primitives.
Sources synthesized
| Source | Scale | Key contribution | |--------|-------|------------------| | Karpathy's "behind the new skill tree" | Theory | 4-level skill tree: Conditioning → Authority → Workflows → Compounding | | Claude Code Setup Hook + Justfile pattern | Single repo | Deterministic scripts + agentic prompts for onboarding | | Boris Cherny's thread-based framework | Single engineer | Thread taxonomy (P, C, F, B, L, Z) for scaling agent work | | Julián de Angelis (MercadoLibre) | 20,000 devs | Custom rules, MCPs, skills, SDD, feedback loops at org scale | | OpenAI Codex team | 3–7 engineers | AGENTS.md as table of contents, docs/ as encyclopedia, garbage-collection agents |
The four levers (Julián / MercadoLibre)
1. Custom rules (CLAUDE.md, AGENTS.md, .cursor/rules)
The most accessible lever. Living documents that get injected into the agent's context.
What belongs:
- Tech stack, architecture patterns, naming conventions
- Testing philosophy, common pitfalls, anti-patterns
- Commands (build, test, lint, deploy)
What doesn't belong:
- Entire API docs (wastes context)
- Obvious instructions ("write clean code")
- Contradictory rules
Best practices:
- Keep under 500 lines (context rot starts ~60% window utilization)
- Make them modular (split by concern)
- Use few-shot examples over abstract instructions
- Don't make everything always-on — use conditional loading
2. MCP servers (Model Context Protocol)
Extend the agent beyond file read/write:
- Database queries, internal docs, API contracts
- CI/CD pipeline interaction, design specs
- Validation and testing of agent output
3. Skills
On-demand context injection + executable logic. Only a short description stays in context; full content loads when invoked.
Two flavors:
- Reference skills — inject knowledge (conventions, patterns, domain context)
- Task skills — step-by-step instructions for specific actions
Can bundle scripts, run in isolated subagents, compose into pipelines.
4. Spec-driven development (SDD)
The spec becomes the harness — it engineers the entire context window in one shot. Consolidates custom rules, step-by-step guidance, and acceptance criteria into a single artifact.
Use the agent to write the specs too.
OpenAI's additions
AGENTS.md as table of contents (~100 lines)
Don't make AGENTS.md the encyclopedia. Make it the map that points to deeper sources:
AGENTS.md (100 lines) → docs/architecture.md
→ docs/patterns/
→ docs/decisions/
The agent reads the map always, reads the detail on demand.
Mechanical enforcement
Don't just suggest architectural constraints — enforce them:
- Custom linters that catch violations
- Structural tests validating dependency layers
- CI validation preventing architectural decay
Garbage-collection agents
Agents that run periodically to find:
- Stale documentation
- Violated architectural constraints
- Inconsistencies between docs and code
Karpathy's skill tree (4 levels)
Level 1 — Conditioning (steering)
| Skill | What it means | |-------|---------------| | Intent specification | Tight problem contracts (purpose, audience, constraints) | | Context engineering | What goes in / out of context window, ordering, summarization | | Constraint design | Output formats, schemas, rubrics, tool access, budgets |
Level 2 — Authority (ownership without authorship)
| Skill | What it means | |-------|---------------| | Verification design | How does truth enter the loop? (deterministic checks, human review) | | Provenance | Sources, citations, traceability as first-class objects | | Permissions | Least privilege, deterministic boundaries, audit trails |
Level 3 — Workflows (scaling intelligence)
| Skill | What it means | |-------|---------------| | Pipeline decomposition | Intermediate artifacts, checkpoints, local vs global failures | | Failure-mode taxonomy | Context missing? Retrieval wrong? Tool fail? Hallucination? | | Observability | Tool-call traces, inputs, documents retrieved, timing, cost |
Level 4 — Compounding (durable leverage)
| Skill | What it means | |-------|---------------| | Evaluation harnesses | Golden sets, regression tests, scorecards, thresholds | | Feedback loops | Draft → critique → revise → recheck → ship | | Drift management | Versioning, auditability, treating work as production infrastructure |
Thread types (Boris Cherny framework)
| Thread | Name | Pattern | Best for | |--------|------|---------|----------| | Base | Single | Prompt → Work → Review | Simple tasks | | P | Parallel | N prompts simultaneously | Independent subtasks | | C | Chained | Work → Review → Continue → Work | High-risk, migrations | | F | Fusion | Same prompt → N agents → pick best | Prototyping, confidence | | B | Big | Agent → subagents → combined result | Complex multi-file tasks | | L | Long | High autonomy, hours duration | Background work | | Z | Zero touch | No review needed | Maximum trust, fully validated |
"Tool calls roughly equal impact. Increase tool calls to increase output."
Scale by: more threads, longer threads, thicker threads (agents calling agents), fewer checkpoints.
The feedback loop
Tests, linters, type checkers, build scripts — every tool that produces a pass/fail signal becomes a feedback mechanism for self-correction.
Stop hooks are the most powerful mechanism: the agent cannot finish until checks pass. Not a suggestion — an enforced gate.
The Ralph Wiggum pattern: Loop an agent with deterministic validation. Agents + code beats agents alone.
Practical implementation — what we built at Fuul
| Lever | Implementation |
|-------|----------------|
| Custom rules | CLAUDE.md per repo (6 repos), AGENTS.md for cross-tool (4 repos), shared template |
| MCPs | Slack, Telegram, Pipedrive, Notion, Linear, Granola |
| Skills | 54 skills across 9 departments + 4 new engineering skills |
| SDD | /feature-spec skill + plans/ workflow in every repo |
| Feedback loop | Validation Loops in CLAUDE.md + PR review template |
| Compounding | Compound learning guide + docs/solutions/ + /incident-to-claude-md |
| Garbage collection | /audit-claude-md, /sync-claude-md (manual trigger) |
Architecture
fuul-agents-workspace (shared brain)
├── .claude/context/engineering/
│ ├── claude-md-template.md ← shared standard
│ ├── agents-md-standard.md ← cross-tool guide
│ ├── pr-review-templates/ ← parameterized review
│ └── compound-learning-guide.md ← learning loop
├── .claude/skills/engineering/
│ ├── audit-claude-md/ ← garbage collection
│ ├── improve-claude-md/ ← auto-improvement
│ ├── sync-claude-md/ ← cross-repo consistency
│ └── incident-to-claude-md/ ← incident → prevention
Each code repo:
├── CLAUDE.md ← Claude-specific (conditioning)
├── AGENTS.md ← tool-agnostic (cross-tool)
├── docs/solutions/ ← compound learning
└── .cursor/rules/ ← Cursor-specific
Key principles
- Separate generation from decisioning — the model generates, the workflow / system / human decides.
- Context is finite — every token wasted on irrelevant rules is a token not available for code.
- Rules are living documents — every agent mistake is a chance to improve the harness.
- Mechanical enforcement > suggestions — linters and tests beat anti-pattern lists.
- On-demand > always-on — load detailed context only when relevant (skills, docs/).
- The harness compounds — incident → learning → prevention → better harness.
Open questions
- When does AGENTS.md maintenance burden exceed its cross-tool value for small teams?
- How to implement setup hooks across repos without over-engineering for 2–5 engineers?
- Should heavy CLAUDE.md patterns (600+ lines) be split into
docs/with CLAUDE.md as map? - How to schedule garbage-collection agents (audit/sync) without manual triggers?
Connection points
- The harness pattern is the foundation agent-orchestrator is built on — every agent it spawns is a CLAUDE.md harness template plus a tool surface plus an eval gate.
- Pairs with Karpathy Autoresearch — Deep Research Report — the harness gives you reliability for one shot; autoresearch gives you reliability over hundreds of shots.
- The eval-platforms research is the L4 (compounding) end of this stack: harnesses that measure themselves are what stop the harness from rotting.